Handling Cascading Failures: The Case for Topology-Aware Fault-Tolerance

نویسندگان

  • Soila Pertet
  • Priya Narasimhan
چکیده

Large distributed systems contain multiple components that can interact in sometimes unforeseen and complicated ways; this emergent “vulnerability of complexity” increases the likelihood of cascading failures that might result in widespread disruption. Our research explores whether we can exploit the knowledge of the system’s topology, the application’s interconnections and the application’s normal fault-free behavior to build proactive fault-tolerance techniques that could curb the spread of cascading failures and enable faster system-wide recovery. We seek to characterize what the topology knowledge would entail, quantify the benefits of our approach and understand the associated tradeoffs.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)

Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...

متن کامل

The Relaxed-Ring: a Fault-Tolerant Topology for Structured Overlay Networks

Fault-tolerance and lookup consistency are considered crucial properties for building applications on top of structured overlay networks. Many of these networks use the ring topology for the organization or their peers. The network must handle multiple joins, leaves and failures of peers while keeping the connection between every pair of successor-predecessor correct. This property makes the ma...

متن کامل

A Survey of QoS Multicasting Issues

The recent proliferation of QoS-aware group applications over the Internet has accelerated the need for scalable and efficient multicast support. In this article, we present a multicast “life-cycle” model which identifies the various issues that are involved in a typical multicast session. During the life-cycle of a multicast session, three important events can occur: group dynamics, network dy...

متن کامل

Characteristics , Impact , and Tolerance of Partial Disk Failures

Hard-disk failures are one of the primary causes of data loss in both enterprise storage systems and personal computers. Most disk failures are partial failures, where only some sectors are unavailable due to a latent sector error or some blocks are silently corrupted. This dissertation focuses on all aspects of such partial disk failures – their characteristics, their impact on different syste...

متن کامل

Topological Analysis and Mitigation Strategies for Cascading Failures in Power Grid Networks

Recently, there has been a growing concern about the overload status of the power grid networks, and the increasing possibility of cascading failures. Many researchers have studied these networks to provide design guidelines for more robust power grids. Topological analysis is one of the components of system analysis for its robustness. This paper presents a complex systems analysis of power gr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005